{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "76a5c2a0-e554-40e2-8f17-3911c2f1e5a2",
   "metadata": {},
   "source": [
    "<!-- Banner Image -->\n",
    "<img src=\"https://uohmivykqgnnbiouffke.supabase.co/storage/v1/object/public/landingpage/brevdevnotebooks.png\" width=\"100%\">\n",
    "\n",
    "<!-- Links -->\n",
    "<center>\n",
    "  <a href=\"https://brev.nvidia.com\" style=\"color: #06b6d4;\">Console</a> •\n",
    "  <a href=\"https://developer.nvidia.com/brev\" style=\"color: #06b6d4;\">Docs</a> •\n",
    "  <a href=\"/\" style=\"color: #06b6d4;\">Templates</a> •\n",
    "  <a href=\"https://discord.gg/NVDyv7TUgJ\" style=\"color: #06b6d4;\">Discord</a>\n",
    "</center>\n",
    "\n",
    "# Deploy the Efficient ViT Segmentation Models\n",
    "\n",
    "#### Segmentation in image processing\n",
    "Segmentation in the context of machine learning refers to the process of dividing or partitioning data into multiple segments or groups based on shared characteristics. This concept is widely applicable across various fields such as image processing, market analysis, natural language processing, and more. The primary goal of segmentation is to simplify or change the representation of data to make it more meaningful and easier to analyze. In image processing, segmentation involves dividing a digital image into multiple segments (sets of pixels, also known as image objects). The goal is to make the representation of an image more meaningful and easier to analyze by organizing its pixels into segments that are more homogeneous than the entire image.\n",
    "\n",
    "#### Efficient ViT\n",
    "EfficientViT is a new family of ViT models for efficient high-resolution dense prediction vision tasks. The core building block of EfficientViT is a lightweight, multi-scale linear attention module that achieves global receptive field and multi-scale learning with only hardware-efficient operations, making EfficientViT TensorRT-friendly and suitable for GPU deployment. In this notebook we demonstrate how to build each models engine files and compare each model in a unified gradio interface!\n",
    "\n",
    "#### Credits\n",
    "Efficient ViT was created by the [MIT-Han-Lab](https://github.com/mit-han-lab/efficientvit). Deployment of the notebook is powered by [Brev](brev.dev)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e669f6ea-1eeb-4a9a-be6a-780a348140f4",
   "metadata": {},
   "source": [
    "## Getting Started\n",
    "\n",
    "We start by installing the recommended dependancies from the repository README"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9908a34b-d388-4314-a494-ea4ff4b357db",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!wget -L https://raw.githubusercontent.com/mit-han-lab/efficientvit/master/requirements.txt\n",
    "!pip install openmpi\n",
    "!pip install -r requirements.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8fb7f218-31d3-42ad-a646-4305b9b168d4",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!git clone https://github.com/NVIDIA-AI-IOT/torch2trt\n",
    "!cd torch2trt && python setup.py install"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f97534b-b813-4f66-af8f-4f06145c422b",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "!cd torch2trt && cmake -B build . && cmake --build build --target install && ldconfig"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbae0180-724d-4b30-8487-a81b7c700cde",
   "metadata": {},
   "source": [
    "## Download model checkpoints\n",
    "\n",
    "We pull each model checkpoint from the Huggingface repo and save them in the assets folder. Since we have access to an A100 and TensorRT, we will be using the ONNX formatted models and converting them to TRT engines. However this process can also be done with the PyTorch models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ebaadeec-a9bc-43d3-857e-274e15560b8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import snapshot_download\n",
    "\n",
    "snapshot_download(\"mit-han-lab/efficientvit-sam\", local_dir=\"assets/checkpoints\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "688a37fc-9866-4eff-aa38-d6df928dc418",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a folder to store the models\n",
    "!mkdir -p assets/export_models/sam/tensorrt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3ccba7c4-f1f4-4e12-8d1e-a0eb669687cb",
   "metadata": {},
   "source": [
    "Here we build each models encoder and decoder engine based on the type of segmentation. Notice that the L0, L1, and L2 models can ingest resolutions up to 512x512 and the XL0 and XL1 models can ingest resolutions up to 1024x1024. \n",
    "\n",
    "As a reminder, a TRT engine is essentially an optimized version of the model that is built to run on the current hardware. In our case they're optimized to run on an A100-40GB!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ba1d635-eaac-4d89-bd5a-517deee379d9",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(\"Creating L0 TensorRT encoder with side length 512\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/l0_encoder.onnx \\\n",
    "    --minShapes=input_image:1x3x512x512 \\\n",
    "    --optShapes=input_image:1x3x512x512 \\\n",
    "    --maxShapes=input_image:4x3x512x512 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/l0_encoder.engine\n",
    "\n",
    "print(\"Creating L0 TensorRT point decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/l0_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:1x16x2,point_labels:1x16 \\\n",
    "    --maxShapes=point_coords:1x16x2,point_labels:1x16 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/l0_point_decoder.engine\n",
    "\n",
    "print(\"Creating L0 TensorRT box decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/l0_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:16x2x2,point_labels:16x2 \\\n",
    "    --maxShapes=point_coords:16x2x2,point_labels:16x2 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/l0_box_decoder.engine\n",
    "\n",
    "print(\"Creating L0 TensorRT full image segmentation decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/l0_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:64x1x2,point_labels:64x1 \\\n",
    "    --maxShapes=point_coords:128x1x2,point_labels:128x1 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/l0_full_img_decoder.engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "00f59df1-1835-4f95-ae48-68cd6d128b0f",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(\"Creating L1 TensorRT encoder with side length 512\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/l1_encoder.onnx \\\n",
    "    --minShapes=input_image:1x3x512x512 \\\n",
    "    --optShapes=input_image:1x3x512x512 \\\n",
    "    --maxShapes=input_image:4x3x512x512 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/l1_encoder.engine\n",
    "\n",
    "print(\"Creating L1 TensorRT point decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/l1_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:1x16x2,point_labels:1x16 \\\n",
    "    --maxShapes=point_coords:1x16x2,point_labels:1x16 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/l1_point_decoder.engine\n",
    "\n",
    "print(\"Creating L1 TensorRT box decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/l1_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:16x2x2,point_labels:16x2 \\\n",
    "    --maxShapes=point_coords:16x2x2,point_labels:16x2 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/l1_box_decoder.engine\n",
    "\n",
    "print(\"Creating L1 TensorRT full image segmentation decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/l1_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:64x1x2,point_labels:64x1 \\\n",
    "    --maxShapes=point_coords:128x1x2,point_labels:128x1 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/l1_full_img_decoder.engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01701385-7b35-438c-9dc2-69fbb6d41896",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(\"Creating L2 TensorRT encoder with side length 512\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/l2_encoder.onnx \\\n",
    "    --minShapes=input_image:1x3x512x512 \\\n",
    "    --optShapes=input_image:1x3x512x512  \\\n",
    "    --maxShapes=input_image:4x3x512x512  \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/l2_encoder.engine\n",
    "\n",
    "print(\"Creating L2 TensorRT point decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/l2_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:1x16x2,point_labels:1x16 \\\n",
    "    --maxShapes=point_coords:1x16x2,point_labels:1x16 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/l2_point_decoder.engine\n",
    "\n",
    "print(\"Creating L2 TensorRT box decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/l2_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:16x2x2,point_labels:16x2 \\\n",
    "    --maxShapes=point_coords:16x2x2,point_labels:16x2 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/l2_box_decoder.engine\n",
    "\n",
    "print(\"Creating L2 TensorRT full image segmentation decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/l2_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:64x1x2,point_labels:64x1 \\\n",
    "    --maxShapes=point_coords:128x1x2,point_labels:128x1 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/l2_full_img_decoder.engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bb2c6ce7-2ded-4173-9008-e95bf7e3d9d2",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(\"Creating XL0 TensorRT encoder with side length 1024\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/xl0_encoder.onnx \\\n",
    "    --minShapes=input_image:1x3x1024x1024 \\\n",
    "    --optShapes=input_image:1x3x1024x1024 \\\n",
    "    --maxShapes=input_image:4x3x1024x1024 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/xl0_encoder.engine\n",
    "\n",
    "print(\"Creating XL0 TensorRT point decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/xl0_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:1x16x2,point_labels:1x16 \\\n",
    "    --maxShapes=point_coords:1x16x2,point_labels:1x16 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/xl0_point_decoder.engine\n",
    "\n",
    "print(\"Creating XL0 TensorRT box decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/xl0_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:16x2x2,point_labels:16x2 \\\n",
    "    --maxShapes=point_coords:16x2x2,point_labels:16x2 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/xl0_box_decoder.engine\n",
    "\n",
    "print(\"Creating XL0 TensorRT full image segmentation decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/xl0_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:64x1x2,point_labels:64x1 \\\n",
    "    --maxShapes=point_coords:128x1x2,point_labels:128x1 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/xl0_full_img_decoder.engine"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a986537-a49b-4f96-9abf-6a8afe3abd93",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "print(\"Creating XL1 TensorRT encoder with side length 1024\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/xl1_encoder.onnx \\\n",
    "    --minShapes=input_image:1x3x1024x1024 \\\n",
    "    --optShapes=input_image:1x3x1024x1024 \\\n",
    "    --maxShapes=input_image:4x3x1024x1024 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/xl1_encoder.engine\n",
    "\n",
    "print(\"Creating XL1 TensorRT point decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/xl1_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:1x16x2,point_labels:1x16 \\\n",
    "    --maxShapes=point_coords:1x16x2,point_labels:1x16 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/xl1_point_decoder.engine\n",
    "\n",
    "print(\"Creating XL1 TensorRT box decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/xl1_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:16x2x2,point_labels:16x2 \\\n",
    "    --maxShapes=point_coords:16x2x2,point_labels:16x2 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/xl1_box_decoder.engine\n",
    "\n",
    "print(\"Creating XL1 TensorRT full image segmentation decoder\")\n",
    "!trtexec --onnx=assets/checkpoints/onnx/xl1_decoder.onnx \\\n",
    "    --minShapes=point_coords:1x1x2,point_labels:1x1 \\\n",
    "    --optShapes=point_coords:64x1x2,point_labels:64x1 \\\n",
    "    --maxShapes=point_coords:128x1x2,point_labels:128x1 \\\n",
    "    --fp16 \\\n",
    "    --saveEngine=assets/export_models/sam/tensorrt/xl1_full_img_decoder.engine"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9da22fb-945e-479d-bd6f-963a5c9d1d6e",
   "metadata": {},
   "source": [
    "## Build the gradio web server to host each model\n",
    "\n",
    "Now that we have the TRT engine files, we can leverage the EfficientViT inference code and launch our own gradio server to run the segmentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2b195567-53d9-48e5-a64b-4ca3c291d48c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# solves an cv2 import bug\n",
    "!pip install opencv-python-headless"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3e9d621-b470-4d9f-aa41-0cfcace5222b",
   "metadata": {},
   "outputs": [],
   "source": [
    "!git clone https://github.com/mit-han-lab/efficientvit.git"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ed2c54b9-4e8c-477c-a3f1-e91e3c46132d",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mv assets/export_models/ efficientvit/assets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "70d67540-83b3-4031-9c53-3db5fc75631b",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip show tensorrt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ec3266ee-85cb-4bfb-9abe-eaa4f5980921",
   "metadata": {},
   "outputs": [],
   "source": [
    "!cd /root/verb-workspace/efficientvit && python -m demo.sam.gradio_web_server --runtime tensorrt"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
